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Abstract 

We present a binary routing tree protocol for distributed hash table overlays. Using this 
protocol each peer can independently route messages to its parent and two descendants 

(N ' 

on the fly without any maintenance, global context, and synchronization. The protocol 

c3 '. 

is then extended to support tree change notification with similar efficiency. The resulting 
tree is almost perfectly dense and balanced, and has 0(1) stretch if the distributed hash 
table is symmetric Chord. We use the tree routing protocol to overcome the main imped- 

Q ' 

o 



iment for implementation of local thresholding algorithms in peer-to-peer systems - their 
requirement for cycle free routing. Direct comparison of a gossip-based algorithm and a 
corresponding local thresholding algorithm on a majority voting problem reveals that the 

r^- ■ latter obtains superior accuracy using a fraction of the communication overhead. 
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1. Introduction 

X 

In a world of millions of wired devices, in-network computation algorithms provide an 
intriguing alternative to centralization. Where distributed data is abundant and bandwidth is 
limited or costly, some applications can only be implemented distributively. Where adverse 
manipulation and control are a concern, distributed architecture is often preferred over 
a centralized agent. Finally, scaling an algorithm to the millions of peers often teaches 
important lessons on asynchrony, speculative execution, and the containment of partial 
failure, which prove important to more mundane environments such as grid systems. 
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Algorithms for distributed computation in peer-to-peer systems fall into several cate- 



gories. Of these, gossip algorithms are possibly the most popular and certain 
tensively studied [1, ,lEl90- Local thresholding algorithms fllPl 



ythe most ex 



Ui 



12, 



13 



14] 



are comparable to gossip based algorithms because both address similar problems and sim- 
ilarly provide a proof for convergence. Local thresholding algorithms are considered by 
far more communication efficient than gossip based algorithms. However, they pose far 
stricter requirements to the underlying routing protocol. A gossip based algorithm basically 
requires an efficient way in which information can be propagated to random destinations. 
In contrast, all known local thresholding algorithms require cycle free routing. Often, work 
on local thresholding algorithms advocates that a routing tree be induced in preprocessing. 
However, the non-trivial complexity of inducing and maintaining the tree in a dynamic 
network has so far rendered local algorithms impractical. 

This work considers the problem of computation in distributed hash-table (DHT) over- 
lays - the de-facto standard architecture in peer-to-peer networks. Gossip algorithms can 
easily be implemented on a DHT: If each peer sends messages to a random peer from its 
finger table then in O (log N) messages this information will arrive to a random peer. Local 
thresholding can be implemented in a DHT using one of the existing tree routing protocols. 



However, existing tree routing protocols [115 
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1711 are ill-fit for a local thresholding al- 



gorithm. Because these protocols were developed mainly to reduce message redundancy in 
broadcast or convergecast they operate in a top-down or bottom-up manner. Thus, a peer 
cannot send messages to its tree neighbors without the involvement of either the root (in a 
top-down protocol) or its entire subtree (in a bottom-up). 

This paper makes two main contributions to the state-of-the-art: First, it presents new 
binary tree routing and change notification protocols for DHT overlays. This tree routing 
protocol is local and can be used for multi-way communication over the tree, including 
broadcast and convergecast. The effect of peer joining or leaving is also local and can be 



detected and notified using no more than six messages that are routed on the tree. The 
Enabled by the binary tree routing protocol, the second contribution of this paper is a direct 
comparison of local thresholding and gossip based algorithms. Our experiments show that 
regardless of system size or properties of the data, local thresholding vastly outperforms 
gossip. The results are so one sided that they call into question the continued relevance of 
gossip algorithms to computation in DHT overlays. 

The rest of this paper is organized as follows: The next section describes the binary tree 
routing protocol and the change notification protocol. Section [3] details the implementation 
of the two majority voting algorithms. Experiments are described in Section |4] and related 
work in Section |5] Finally, Section |6] draws conclusions and poses some further research 
problems. 

2. Local Binary Tree Routing 

The basic idea of the binary tree routing protocol is to define a mapping of peers to 
a subset of the nodes of a full binary tree. The binary tree can be defined in terms of a 
one-to-one mapping of d-long binary strings, namely addresses, to tree nodes. Then we 
define which peer is mapped to which address. 

Consider a binary tree whose root is the all zero address. Any address other than that of 
the root, we divide into three parts: An all zero suffix, which might be empty, the rightmost 
set bit, and a prefix, which might be empty as well. An address is therefore encoded as 
plO k , where the length of the prefix p is d — k — 1. We define that the clockwise descendant 
of the address plO fc is the address pllO fe_1 and the counterclockwise descendant of the 
address plO fc is p010 fc_1 . Addresses ending with a set bit, i.e., plO°, have no descendants. 
For completeness, we denote lO ^ 1 the clockwise descendant of the root. As can be seen 
in Figure l2Tal this mapping is similar, but not identical to, the textbook implementation of 
a complete binary tree in an array. 
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Figure 2.1: Binary Tree routing 
(a) Tree mapping to the address space. (b) Binary Tree Routing on DHT 




A peer's position in the tree depends on its assigned address space segment. The peer 
whose address space segment contains the all zero address takes the root position. Any 
other peer takes a position calculated as follows: Let the address space segment of pi 
be (ai_i,Oi]. Let p be the (possibly empty), common prefix of Oj_i and ai, such that 
di-i = pOX and a, = plY. Then pi takes the position plO k . Note that messages routed to 
the address which is a peer's position will always be accepted by that peer. 

We conveniently denote the position with which p^ is associated as posi. We further 
denote the clockwise descendant of pos^ as CW [posi] and its counterclockwise descendant 
as CCW [posi]. Respectively, if pos j = CW [post] or pos j = CCW [poSj\, then we denote 
posi = UP [posj] . The functions CW, CCW, and U P can be computed for any pos^ using 
bit manipulations. 

The following two lemmas show that if there is more than one peer in the clockwise 

(or, respectively, the counterclockwise) subtree of a peer's position then one of those peers 

occupies a position which is a fore-parent of the positions of all other peers. 

Lemma 1. The address space segment associated with the peers whose positions are the 
subtree below any peer pi is continuous. 



Proof. See Appendix A □ 
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Lemma 2. For any peerpi whose position is pos it one of the peers which occupies positions 
in the subtree of position CW [post] (respectively, CCW [posi]) occupies a position which 
is a fore-parent of the positions of all of the other peers in that subtree. 



Proof. See Appendix A □ 



Lemma[2]can be put into direct use to define a binary tree of peers, rather than of posi- 
tions: If all peers in the subtree of the address clockwise from p^'s position have positions 
which are in the subtree of p cw , then we denote p cw the clockwise neighbor of pj. Likewise, 
the counterclockwise neighbor of pi is the peer p ccw , whose position is the fore-parent of 
all of the positions in the counterclockwise subtree of posi that are occupied by peers. 

It remains to define how messages can efficiently be routed from a peer to its neighbors 
on the tree. The pseudocode of a protocol achieving this is detailed in Alg. \T\ To deliver 
messages to the UP neighbor of a peer p^, they are first addressed to UP [posi] and then 
continue being routed to the UP [pos] of that address until they reach an address occupied 
by a peer. Clockwise and counterclockwise messages are first routed to CW [posi] and 
CCW [posi]. If they reach an address not occupied by a peer, this is because the destination 
falls in the address space segment of a peer pj occupying a different position. A new 
destination, which is a step down the tree and away from that of posj, is thus computed. If 
the destination address exhausts the address space, the message is dropped. 

Forwarding a message again and again until the destination is found or the address space 
is exhausted is often wasteful and unnecessary. Whenever the address of the destination 
position falls in the address space of a peer who has a different position and is also a 
neighbor of the sender, the message can be dropped. This is because the message is doomed 
to be sent back and forth between the sender and the receiver until eventually being dropped 
as there is no peer between them to accept it. Fortunately, such communication patterns 
can easily be avoided if the sender denotes as part of the message header the edge of its 
address space in the direction in which the message is sent. The recipient can then compare 
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Algorithm 1 Local Binary Tree Routing 

On downcall to SEND with message M and direction d: 

If d is upward then dest <— UP [posi] and edge <— null 

If d is counterclockwise then dest -f- CCW [posi] and edge <— aj_i 

If d is clockwise then dest «— CW [posj] and edge <r- a. 

Make a downcall to SEND with the destination address dest and the message 
(posi, dest, edge, M) using the DHT 

On upcall DELIVER with the message {origin, dest, edge, M): 

If dest = posi then call ACCEPT with the message M and finish. 

If dest is a fore parent of origin then newdest ^ UP [dest] and newedge raw// 

Else if dest is in the clockwise subtree of origin then 

- If edge = Oj_i then finish 

- If origin = posi then newdest <— CW [dest] and newedge <— a.; 

- Else newdest «— CCW [dest] and newedge 
Else 

- If edge = Oj then finish 

- - If origin = posi then newdest <— CCW [dest] and newedge <2j_i 

- - Else newdest [rfest] and newedge a,; 

- Make a downcall to SEND with the destination newdest and the message 

(orig, newdest, newedge, M) 



that to the edge of its own address space segment. If the edges are the same, the message 
can be dropped. 

Figure I2T51 illustrates this address scheme in a DHT composed of just nine peers and 
an address space of eight bits. For instance, peer number 5, whose address space segment 
is (01110000, 10011000], takes the tree position 10000000, and so forth. Messages routed 
counterclockwise from 9 first reach position 1 1010000, which is in the address space seg- 
ment of peer number 7. However, since the position of peer number 7 is 11000000, the 
message is then bounced clockwise to position 1101 1000, which is occupied by peer num- 
ber 8. 

2.1. Tree properties 

The properties of the tree which are the most important are its expected maximal depth, 
the expected degree of internal nodes and the expected stretch of a hop. The expected 
maximal depth is mostly important for broadcast and convergecast applications, in which 
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a message must traverse the full depth of the tree. The expected degree determines the ex- 
pansion rate, which is central for repeated averaging algorithms such as gossip algorithms. 
In any algorithm, actual performance is proportional to the stretch - the number of actual 
messages needed to deliver a message from a tree node to its parent or its descendant. 
Lemma 3. The maximal depth of the tree is O (log N) where N is the number of peers. 



Proof. See Appendix A □ 



The stretch of the tree is defined in terms of the number of times the tree routing pro- 
tocol calculates a new destination for an UP message and lets the DHT route the message. 
Each such message may require up to log N IP messages. However, since the tree closely 
follows the finger table logic, symmetric Chord peers will almost always have a direct link 
to their CW, CCW and UP neighbors. Therefore, the number of IP messages required for 
every DHT routing in symmetric Chord is O (1). Notice that messages in the CW and CCW 
direction follow the same path on the tree as the UP message in the opposite direction, and 
therefore have the same stretch. 

Lemma 4. The expected stretch of the tree is a small constant. 



Proof. See Appendix A □ 
2.2. Neighbor change notification 

The binary tree routing protocol in Alg. [Q defines neighbor relations logically and is 
therefore immune to peer dynamics. Whenever peers join or leave the system the protocol 
simply reflects the change by delivering messages according to the current tree structure. 
However, some algorithms, including the one in the next section, still require explicit noti- 
fication when one of the tree neighbors changes. 

The neighbor change notification protocol is based on the following property of the 
binary tree routing protocol: Let pi be the successor of Pi_i and let their positions be posi 
and pos,i_i respectively. If p^i leaves the system then the position of pi either remains 
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Algorithm 2 Neighbor Change Notification 



Definitions: Pos (a, b) is the position of a peer whose address space segment is (a, b] 
On u pea 11 NOTIFY that the predecessor address has changed from a^ 2 to a;_i or 
vice-versa: 

Compute pos f ix = Pos (aj_ 2 , a^) and pos var = 

Pos (di-i, di) Pos (di-2, a-i-i) = pos fix 

POS (a;_ 2 , CLi-l) POS (d^i, Clj) = pOS fix 

Send the message (ALERT, pos fi X ) in direction UP, CW and CCW from posfi X using 
binary tree routing. 

Send the message (ALERT, pos vo i) in direction UP, CW and CCW from pos vo i using 
binary tree routing. 

On upcall ACCEPT with the message (ALERT, pos): 
If pos is a fore-parent of posi then dir upward 
Else if pos is in the clockwise subtree of pos^ then dir clockwise 
Else dzr counterclockwise 

Notify the application of a possible change of the neighbor in direction dir 

posi or changes to posj_i. In the former case, the parent of becomes the parent of its 
single direct descendant, if one exists. In the latter, Pi-i's former neighbors become the 
new neighbors of pi and the former parent of becomes the parent of p/s former single 
direct descendant. The same property can prove that it is sufficient to alert those same five 
peers when joins the system. 

Lemma 5. The addition or removal of a peerpi can only affect the tree connectivity of only 
five peers which are all tree neighbors of either pi or its successor p i+ i. 



Proof. See Appendix A □ 



This property can be used to provide alerts on any single local change in topology. 
This is because when p^\ leaves or joins the system, the DHT alerts its successor that its 
address space segment has changed. Once pi is informed of the change in the address space 
segment it is able to calculate the positions whose neighbors might have changed. Hence, 
Pi can route alert messages in all directions from those positions. 
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3. Majority Voting 

Given the infrastructure provided by the DHT overlay and by the binary tree routing 
and change notification protocols, we next compare representative local thresholding and 
gossip algorithms. We consider the simplest computation task: a majority vote. However, 
the two algorithms we choose are good representations of their respective families. We use 



a variant of the local majority voting algorithm of Wolff and Schuster fl 1 Ofl and compare 
it to LiMoSense [9], which is a variant of the gossip averaging algorithm of Kempe et al. 
til], suitable for dynamic data. Both algorithms were slightly adapted, and are therefore 
described in the following subsections. 

The input for both algorithms is a single bit Xi E {0, 1} at each peer pi and the compu- 
tational task is to decide if on average most bits are one or zero. We realistically assume 
that the input of the peers can change at any moment, and thus that the algorithm never 
terminates. The output of each peer is an ad-hoc assumption on the majority. 

3.1. Local majority voting 

In local majority voting, every peer bases its output on the statistics of votes it accepts 
from its tree neighbors - namely: UP, CW, and CCW. The peer stores for each of those 
neighbors two counter pairs: Xup,i and X it up for the upward direction, Xcw,i an d X it cw 
for the clockwise, and X C cw ,i an d X i C cw f° r me counterclockwise direction. Each of 
those counter pairs counts votes and the number of those votes which are of one. The 
counter pair X Vji records the latest message received from direction v, and X i v the latest 
message sent to direction v. They both are initially (0, 0). We conveniently denote X±i = 
(xi, 1) for the input of pi. The knowledge of a peer is defined as the sum of all its inputs 
K-i = *l2d€{UPCw,ccw,±}Xd,i- Whenever, according to its knowledge, the majority is of 
ones, (l, — I)* Ki > 0, the peer outputs one. Otherwise, it outputs zero. 

To decide when and which messages it must send, the peer computes for every di- 



9 



rection d e {UP,CW,CCW} the agreement A ijd = X d>i + X ijd . A violation occurs 
when for a direction v E {UP, CW, CCW} the sign of the agreement disagrees with the 
sign of the difference between the knowledge and the agreement: (l, — |) Ai, v > when 
(1, -If (Jd - A,v) < or (1, A,v < when (l, -I)' (/Q - A l>v ) > 0. Such vio- 
lations can be triggered by initialization, by a change of the peer's vote, or by an incoming 
message which changes one of the X d:i . 

To resolve a violation triggered by the agreement with a neighbor in direction v, a peer 
can send a message containing information on all of the votes received from neighbors 
in other directions. This is done by computing X ijV <— /Q — X Vji and sending X ijV to the 
neighbor in the direction v. Notice that after this message is sent, A^ d = ^i, which resolves 
the violation. 

When a neighbor in direction v changes, the pairs X ijV and X Vji no longer reflect mes- 
sages sent to or received from the current neighbor. Therefore, when a peer receives an 
alert of a change in direction v, it sets X Vii to (0, 0) and sends a message to that direction, 
which sets A itV once more to /Q. The change detection protocol alerts the new neighbor 
as well. So the new neighbor will sending a message which reflects its own knowledge. 
Once both peers send and accept those messages, A itV is again equal to A Vti and reflects an 
agreement between pi and its new neighbor. 

Note that if pi does not have a neighbor in direction v, then X Vji remains zero and does 
not affect /Q or the result. Messages sent by pi in direction v would be dropped by the 
binary tree routing protocol, but this would not be indicated to p,^ We prefer wasting those 
messages to complicating the protocol with NACK messages. Additionally, note that to 
support the possibility of out of order message delivery, a sequential number is attached to 
each outgoing message and a message is dropped when it arrives after a message which 
was sent subsequently. 
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Algorithm 3 DHT Local Majority Voting 
Input of peer p^. A vote Xi e {0,1} 
Data structure of p^. 

X± ti initializes to fa, 1); X V p,i, X ijUP ,X C w,i, X itC w,X C cw,i, X LC cw, all initialized to 

(0, 0), seq, lastjjp, last C w> lastccw ai l initialized to 0. 

Output of peer p^. One if (l, — |) /Q > zero otherwise. 

On change of x»: Set X_L,j = (ccj, 1) and call test() 

On an upcall to ACCEPT with a message (X, seq) from Posji 

Let v G {C/P, CW, CCW} be the direction of Posj from Posi. 

If seg > /ast„ then set X V;i 4— X, last v <— seq, and call test() 

On an upcall to ALERT with direction v: Set X v>i <— (0, 0) and call Send(-u) 

Procedure test(): 

For v E {UP,CW, CCW}, if (l, — |)* Ai, v > and (l, -|)* (1Q - Ai, v ) < or 
(1, -|)' A,« > and (1, -i)* (/Cj - A^) < then call Send(w) 
Procedure Send(w): 

Let Xj jtJ /Cj — Xy t j, seq <c— seg + 1 

Send a message (Xj „, seg) in direction v using binary tree routing. 



3.2. Gossip majority voting 

pi 

The gossip algorithm we use is a variant of LiMoSense Q2D - To simplify the descrip- 
tion and the experiments, we use the failure free version, which does not handle joining 
and leaving of peers, or unreliable messaging. We make one important adjustment to the 
algorithm: instead of selecting the destination uniformly at random we select uniformly 
from among the different destinations in the peer's finger table. This is justified because in 
a DHT, following a random finger O (log N) times will lead to a uniformly picked random 
peer using just O (log AT) messages. A second change, which is semantic more than algo- 
rithmic, is that the output is quantized to either zero or one, in line with the voting problem. 
A detailed description of LiMoSense is not included here for lack of space. 

4. Experimental validation 

We conducted two sets of experiments to validate the usefulness of our algorithms. The 
first experiment evaluates the performance of the binary tree routing protocol in terms of 
the efficiency of the tree it induces: the degrees of peers, their depth, and the stretch - 
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the number of real messages required to send a message from a tree node to its neighbor. 
The second compares the local majority voting algorithm, which uses binary tree routing 
as the communication infrastructure, to LiMoSense, which does not. The algorithms are 
compared in terms of their scalability and response to stationary and non-stationary changes 
in the data. 



We employed a standard peer-to-peer network simulator, peersim H18H . The simulator 
is efficient enough to simulate up to a million peers in some experiments. We used reliable 
messaging and random network delays of from one to ten simulation cycles. The objective 
of the delay is not to approximate wall time but rather to decouple the peers and avoid 
locked-step behavior. When using a Chord overlay, we use an existing add-on to peersim. 



When using Symmetric Chord H19H . we use our own variant, which initializes finger tables 
accordingly. All measurements are averaged on ten random experiments, using different 
random seeds. 
4.1. Tree Properties 

We investigated two key properties of the tree induced by the binary tree routing pro- 
tocol: The density, the depth, and the stretch. The depth of the tree nodes is the distance 
from the root to each of them. The depth is important mostly for applications which use 
global communication such as broadcast and converge-east because, for those applications, 
the depth is proportional to the delay. As can be seen in Figure l4~Tal for a tree of N peers, 
the first log N — 2 levels tend to be completely full. The largest number of peers are at 
the log N level of the tree, and the reminder are at a small additional depth. In none of the 
experiments we conducted, even with a million peers, was a peer ever at a depth greater 
than log (N) + 6. We conclude that the tree is extremely well balanced. 

The stretch of a routing overlay is the number of actual messages needed to deliver a 
message from a peer to its tree neighbor. This metric assumes most of the cost of the pro- 
tocol is associated with application level routing decisions (i.e., finding the correct finger, 

12 



and so forth) and not with network delays. 

Figure l4Tb~1 depicts the percentage of neighbors at any given hop distance. It compares 
the results for a symmetric Chord network of 10,000 and of 100,000 peers. The results 
are nearly identical: 85% percent of the peers are one or two hops away from their tree 
neighbors. These results are then contrasted with a (non-symmetric) Chord network of 
10,000 peers. In that network the hop distance to a neighbor is a combination of the hop 
distance to clockwise neighbors, which is the same as that in symmetric Chord, and the 
hop distance to a counterclockwise neighbor, which is the same as the distance between 
any two random Chord peers. When using regular Chord overlay, 75 percent of the tree 
neighbors are within a hop distance of seven or less. Although not as good, the average 
stretch is still well below log N. 
4.2. Majority Voting on DHT 

The second set of experiments compares local majority using the binary tree routing 
protocol in the context of local majority voting with majority based on gossip. We sepa- 
rate the experimented to between those using static votes and those using stationary vote 
distributions. 

4.2.1. Static data 

An experiment with static data emulates a snapshot scenario of peer-to-peer computa- 
tion. In such scenarios, it is assumed that the input is a distributed sample (i.e., snapshot) 
taken at very large intervals - large enough for the algorithm to stabilize between every 
two snapshots. The goal of an algorithm in this state is to stabilize as quickly and using 
as few messages as possible. We leave out experiments with convergence time because 
of the difficulty of comparing the runtime of a cycle-driven algorithm to an event driven 
one, because of space considerations, and because the results of the two algorithms did not 
differ notably in that respect. 
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The input of the peers is randomly set with an average /j pre . Once all peers compute the 
same output of the majority function, the input of some peers is randomly switched and the 
average is set to \i pos t- At this point, the algorithm proceeds until all of the peers once more 
compute the correct result. The number of messages needed to reach this point is reported. 
Three very distinct cases arise: /i pre < § < /j, post , n pre < Impost < |, and \i post < fi pre < \- 
Other arrangements of n pre , of n pos t, and of ~ are symmetric because both algorithms have 
no preference for a majority of ones or of zeros. In the last of the three cases, convergence 
is instantaneous in both algorithms because no peer ever outputs the wrong majority. We 
therefore focus on the former two cases and experiment with two main arguments: The 
scale - number of peers, and the signal - distances of fi pre from fi pos t, and from |. 

Figure 1431 depicts the number of messages per peer required for each of the algorithms 
so that all peers compute the correct majority on networks of 10,000 to 160,000 peers. The 
experiments in Figure l4~2al depict the first case, with n pre and {i post varied from 10% vs. 
90% through to 40% vs. 60%. The most evident outcome of these experiments is that local 
majority is by far better than LiMoSense in this metric. 

The reason for the difference may be simple: in local majority, it does not take long 
until only a few peers continue to exchange messages. In LiMoSense, as well as in similar 
gossip algorithms, peers continue to send messages periodically until the stopping criterion 
is reached. In this experiment, the stopping criterion is that the last peer has computed the 
correct result. However, gossip would remain inefficient if other stopping criteria, such as 
a fixed number of cycles, or a decrease of variance to some degree, are used. It is the data 
dependency of the local thresholding algorithm which makes the difference. 
4.2.2. Stationary data 

In an experiment with stationary votes, a number of peers are randomly picked at every 
given period and their vote is switched, keeping the overall proportion of zero to one votes 
constant. We denote the fraction of peers whose input changes at each average message 
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Figure 4.1: Tree depth and stretch 



(a) Distribution of peer depth (b) Number of hops to tree neighbor 
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Figure 4.2: Messages until convergence with static data 
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delay (five simulation cycles) the noise rate and measure it in peers per million per cycle 
(ppm/c). 

When inputs constantly change, convergence is impossible and convergence cost be- 
comes meaningless. Instead, it is the proportion of peers which compute the correct out- 
come (i.e., the average accuracy) that matters, and the ongoing communication costs re- 
quired to preserve this level of correctness. A second question is how well is the perfor- 
mance preserved when the number of peer in the system grows. 

Figures I4.3al and l43b1 depict the accuracy and cost of local majority voting for networks 
of 10,000 to 160,000 peers and at various noise rates. As can be seen, regardless of the noise 
rate, both average accuracy and average cost remain constant when the system is scaled- 
up. Furthermore, even when more than one peer in a thousand changes at every simulator 
cycle, the accuracy remains above 90%, and fewer than 2% of the peers send a message at 
every simulator cycle. 

Finally, Figure @3c] compares the performance of local majority voting is compared to 
that of LiMoSense. To compare the two algorithms on equal terms, the message overhead 
of LiMoSense is set to exactly that of local majority voting. Then, LiMoSense is allowed 
to send from twice that number to 256 times the number of messages local majority voting 
sends. As can be seen, the utility of LiMoSense does not degrade with scale. However, 
even when allowed a number of messages which is eight orders of magnitude larger than 
that of local majority voting, more than twice as many peers err, on average, in LiMoSense. 
In terms of utility vs. cost, local majority is overwhelmingly superior. 

5. Related Work 

The work described here relates to work in two areas: computation and use of spanning 
trees in DHT overlays, and computation of majority voting in those and other networks. 
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Figure 4.3: Scalability of local majority on stationary data 
(a) Local majority utility (b) Local majority cost 



I 97 ■ 

5 0.96 • 

'1 0.95 • 

f" 0.94 

§ 0.9 
S. 

0.92 - 
0.91 



*Joise rate 40 ppm/c 
Moise rale 80 ppm/c 
Moise rate 160 ppm/c 
Moise rate 319 ppm/c 
Moise rate 639 ppm/c 
Moise rale 1280 ppm/c 



\CiS 


e rate 40 ppm/c 




\ci; 


e rate 80 ppm/c 




\cis 


e rale 1 60 ppm/c 






e rate 319 ppm/c 




■■.ci: 


e rale 639 ppm/c 




\cis 


e rate 1280 ppm/c 





Number ol peers 



Number of peers 



(c) Local vs. Gossip Majority Voting 

v Local majority 
• LiMoSense X256 
a LiMoSense X128 

■ LiMoSense X64 
3 LiMoSense X32 
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□ LiMoSense 

t LiMoSense X4 

LiMoSense X2 
-i- LiMoSense X1 

% 

Number ol peers 
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5.1. Spanning trees in DHT overlays 

Bottom-up trees are discussed as part of the Scribe system [20], in which peers are 
organized in groups and the peer whose address is the closest to the groupld is the root. 
Reverse-path forwarding allows broadcast in Scribe, but a single peer joining a group can 
alter the parenthood relation. 



El-Ansary et al. [|15|l describe a partition based broadcast tree that does not assure cycle 



freedom. Huang and Zhang Ill7ll improve on that with a protocol that assures cycle freedom 



but in which peer degrees vary from zero to log N. Lately Huang and Zhang II16I1 further 
improved their protocol with balanced DBT in which the right and left descendants of a 
peer are, respectively, the next peer and the peer responsible for the middle address of the 
address space of the parent. Each peer then distributes the broadcast to half the address 
space of the parent. Balanced DBT offers both a bounded out-degree of two and a stretch 
that is typically one. The binary tree routing protocol presented here further improves on 
balanced DBT by removing the need for global partitioning of the address space. Thus, it 
allows not only broadcast, but also convergecast, or multi-way cycle free communication, 
which is the way local majority voting uses it. 
5.2. Distributed majority voting 

The computation of the majority has long been a focal point of algorithms intended for 



in-network computation. It was the subject of the first local data mining algorithm [10], and 



is a straightforward reduction of the push-sum gossip based protocol of Kempe et al. y]]. 
Gossip based algorithms for majority voting were proposed [21, ^]. However, they relate 
to the problem of limiting the space needed by gossip and do not improve the messaging 
overhead beyond push-sup or LiMoSense [|9D. 

Birk et al. Il22ll suggested a local majority voting algorithm for general networks. In 
that work, each "1" vote spans a tree using the Bellman-Ford algorithm until is either 
nulled by a "0" vote or it runs against the tree of another "1" vote. The work has several 
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limitations: Trees are data dependent, so they have to be maintained when the data changes. 
Also, if multiple majority votes are to be taken at once (as often happens in peer-to-peer 
data mining), then different trees are computed for each vote, and expenses accumulate. 
Bellman-Ford also incurs significant synchronization overhead between the branches of 
the tree. In contrast, the local majority voting algorithm described here relies on a binary 
tree protocol which is data independent and only requires (local) maintenance on (local) 
topology changes. 

6. Conclusions and future work 

For almost a decade since gossip based and local thresholding algorithms were first de- 
scribed, the former remain the more practical and the latter the more theoretically efficient. 
Our binary tree routing protocol begins to bridge the gap. While interesting in itself, the 
protocol is important because it permits seamless execution of any local thresholding algo- 
rithm on a DHT overlay. The two kinds of algorithms can thus be realistically compared. 
We believe the conclusion of such comparison is beyond doubt: gossip based algorithms 
are by far inferior to local thresholding algorithms for computation in DHT overlays. 

The biggest challenge remaining is computation of local thresholding algorithms on 
unstructured networks and on networks where communication is noisy and asymmetric. 
Additionally, we see two interesting challenges in implementing the binary tree protocol 
for other structured topologies, and generalizing the protocol for trees of greater degree. 
Such generalization may also serve as a means for controlling communication overhead 
which, although low, is an artifact rather than an argument of current local thresholding 
algorithms. 
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Appendix A. Proofs 

Lemma 6. The address space segment associated with the peers whose positions are the 
subtree below any peer pi is continuous. 

Proof. Assume not then pi cannot be the root because its subtree is all of the peers who, 
together, are associated with the entire address space. Assume, without loosing generality 
that pi is in the position posi = plO k . For p^ to have a discontinuous address space there 
must be at least three peers p cw , p m , and p ccw such that p cw is clockwise from p m which 
is clockwise from p ccw and such that p cw and p ccw are in the subtree of pi but p m is not. 
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Since p cw is in pis subtree we know its position pos cw begins with the prefix p and the 
same goes for the position pos ccw of p ccw . Since the position correspond to an address in 
the peers address space segment, we know all of the addresses in the address space segment 
of p m begin with the prefix p. One of those addresses correspond to p m 's position pos m 
which must therefore begin with the prefix p. Whatever that position is, by applying the 
UP operator to it again and again we will reach posi. Hence, p m is in pis subtree, which 
contradicts the premise. □ 

Lemma 7. For any peerpi whose position is pos^ one of the peers which occupies positions 
in the subtree of position CW \poSi] (respectively, CCW [posi]) occupies a position which 
is a fore-parent of the positions of all of the other peers in that subtree. 

The lemma holds trivially if there are no peers or just one peer in the subtree. Assuming 
there is more than one peer in the subtree, let pos p be the lower common parent position of 
the positions of all peers in the subtree. If pos p is occupied by one of those peers, then the 
lemma is satisfied. Otherwise, pos p is not occupied by a peer, possible only if it is in the 
address space segment of a peer pj which occupies another position posj. Since pos p is the 
lowest common parent, some of the other peers occupy positions in CW [pos p ] and some 
in CCW \pos p ). This means pj cannot be equal to pi, since we know that all the peers are 
in the subtree of CW [posi] (respectively, the subtree of CCW [pos,]) . We are left with the 
conclusion that pos p is in the address space segment of a peer not in p^s subtree. However, 
this is in violation of LemmaQ] 

Lemma 8. The maximal depth of the tree is O (log N) where N is the number of peers. 

Proof. The binary tree is fully defined in terms of addresses regardless of the positions 
actually occupied by peers. The clockwise and the counter-clockwise subtrees of a peer at 
any address are span equal address spaces. The peers, on the other hand, are randomly and 
uniformly distributed in the address space. Hence, if the subtree of a peer contains k peers 
then the number of peers in every subtree is distributed Bin (§,&). 

In a binary search tree built from random insertions, the distribution of the number items 
in every subtree is uniform. It is known that the maximal depth of a random binary search 
tree is roughly 4.3 log N. Since the probability that a subtree has more than ~ + % nodes 
is higher in a random binary search tree than it is in the tree induced by the protocol, the 
maximal depth of the tree which is induced is expected to be smaller than 4.3 log N. □ 

Lemma 9. The expected stretch of the tree is a small constant. 
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Proof. Call the destination address of the first hop the first address, and that of the second 
hop the second address. Any address between that of the initiator and the first and second 
addresses must be part of the subtree of either the initiator or the first and the second peer, 
respectively, because of Lemma [T] If there are more than two hops then the destination 
of the third hop must be in the same address space segment as the first address. Or else, 
the first address would be the highest in its address space segment, and thus would be 
occupied by a peer which would become the parent of the initiator. The same is true for 
the destination of the forth hop, if there is one, and the second address. We conclude that if 
a message in the UP direction makes more than two hops then those hops are between two 
distinct peers - the first and the second one - and that eventually, one of those peers must 
be the parent of the initiator. 

The distance between the first and the second addresses must be larger than the address 
space segment of the initiator. The distance between the first and the third destinations is 
at least as large. If the third destination is not the parent's address then the address space 
segment which includes both the first and the third destinations must be at least three time 
larger than the address space segment of the initiator. Respectively, if the forth hop is not 
the last then the address space segment of the second peer must be at least seven times 
larger than the initiator's address space. In general, if the message hops k > 2 times then 
the address space segment of both the first and the second peer must be at least 2 fc ~ 2 — 1 
larger than that of the initiator. 

It is known ll2~3ll that the length of uniform random segments is exponentially dis- 
tributed. Given that the size of an address space segment is c, the probability that the 
size of the consecutive segments is c ■ 2 k is the probability of sampling both values from the 
exponential distribution, Pr = (l — e~ cA ) e _c2 A . For any constant c, this probability de- 
creases double exponentially in k. We conclude that the expected number of hops between 
a peer and its parent is a constant not much greater than three. □ 

Lemma 10. The addition or removal of a peer p; t can only affect the tree connectivity of 
only five peers which are all tree neighbors of either pi or its successor p i+ i. 

Proof. When p { is added, the address space segment of its successor p i+1 is divided be- 
tween pi and Pi+i. One of the peers receives the address which previously corresponded 
with pos i+ i. Clearly, if that peer is pi then the connectivity of the peers which previously 
where the parent and direct descendants of p i+1 changes, since pi now replaces p i+1 as their 
neighbor. The other peer, be it Pi or Pi+i, receives a new position. Call this peer p new and 
its position pos new . Previous to pi addition, pos new was not occupied by a peer because the 
corresponding address was part of the address space of pi+i and that address space included 
a higher position - pos i+ \. The addresses between that corresponding with pos new and that 
which corresponds to poSi + \ are all in the address space of either pi or Pi + \. Those ad- 
dresses all correspond to positions which are lower than the positions occupied by the two 
peers. Therefore, p new can have at most one descendant. When a message was previously 
routed up from that possible descendant, it was routed to pos new and then forwarded further 
up because pos new was part of an address space belonging to a peer with a different posi- 
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tion. Therefore, whichever is the peer which now accept messages sent up from p new , that 
peer was the previous parent of p new sole possible descendant. Because no other address 
space segment changes, no other peer changes its position. Since we already enumerated 
the neighbors of pi and the connectivity of any peer other than those neighbors does 
not change. □ 
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